Evaluation of the Expressivity of a Swedish Talking Head in the Context of Human-machine Interaction

نویسندگان

  • Jonas Beskow
  • Loredana Cerrato
چکیده

This paper describes a first attempt at synthesis and evaluation of expressive visual articulation using an MPEG-4 based virtual talking head. The synthesis is data-driven, trained on a corpus of emotional speech recorded using optical motion capture. Each emotion is modelled separately using principal component analysis and a parametric coarticulation model. In order to evaluate the expressivity of the data driven synthesis two tests were conducted. Our talking head was used in interactions with a human being in a given realistic usage context. The interactions were presented to external observers that were asked to judge the emotion of the talking head. The participants in the experiment could only hear the voice of the user, which was a pre-recorded female voice, and see and hear the talking head. The results of the evaluation, even if constrained by the results of the implementation, clearly show that the visual expression plays a relevant role in the recognition of emotions. INTRODUCTION In recent years, there has been an increased interest for animated characters in a diverse array of applications: web services, automated tutors for e-learning, avatars in virtual conferencing and computer game characters. The concept of embodied conversational agents (ECAs) -animated agents that are able to interact with a user in a natural way using speech, gesture and facial expression -holds the potential of a new level of naturalness in human-computer interaction, where the machine is able to convey and interpret verbal as well as non-verbal communicative acts, ultimately leading to more robust, efficient and intuitive interaction. Audio visual speech synthesis, i.e. production of synthetic speech with properly synchronised movement of the visible articulators, is an important property of such agents, that not only improves realism, but also adds to the intelligibility of the speech output (c.f. Siciliano et al, 2003). Previous work on visual speech synthesis (see Beskow, 2003 for an overview) have typically been aimed at modeling neutral pronunciation. However, as the agents become more advanced, the need for affective and expressive speech arises. This presents a new challenge in acoustic as well as in visual speech synthesis. Several studies have shown how expressiveness and emotions affect our facial display, e.g. how we raise our eyebrows, move our eyes or blink, or nod and turn our head (Argyle & Cook, 1976, Ekman, 1982), some recent results (Nordstrand, et al. 2004, Magno Caldognetto, Cosi, Cavicchio, 2004) have also shown how articulation is affected by expressiveness in speech, in other words, articulatory parameters behave differently under the influence of different emotions. This paper describes recent experiments with synthesis of expressive emotional visual articulation in a virtual talking head. We have used a data-driven approach, training the system to recreate the dynamics of communicative and emotional facial expressions, in order to avoid cartoon-like expressions that can sometimes result from manually created expressions. By capturing the facial movement of humans we can gain valuable insight into how to control the synthetic agent’s facial expressions. To this end, a multimodal speech corpus of expressive speech has been collected, and this paper presents the first results of the implementation and evaluation of data-driven synthesis of expressive speech in a synthetic conversational agent based on this corpus. 2. DATA COLLECTION To gain knowledge about how to drive our talking head, in terms of expressive nonverbal and verbal behaviour, we have collected a multimodal corpus of emotional speech using an opto-electronic motion tracking system: Qualysis MacReflex– which allows capturing the dynamics of emotional facial expressions. For this study, our speaker, a male native Swedish amateur actor, was instructed to produce 75 short sentences with the 6 basic emotions (happiness, sadness, surprise, disgust, fear and anger) plus neutral. A total of 29 IR-sensitive markers were attached to the speaker’s face, of which 4 markers were used as reference markers (on the ears and on the forehead). The marker setup (as shown in figure 1) largely corresponds to MPEG-4 feature point (FP) configuration. Thanks to the reflective markers it is possible to record the 3D positions for each marker with sub-millimetre accuracy, each 1/60 second, by using four infrared cameras. Audio data was recorded on DAT-tape and visual data was recorded using a standard digital video camera and the optical motion tracking system (Beskow et al.1 2004). 3. TALKING HEAD MODEL Our talking head is based on the MPEG-4 Facial Animation standard (Pandzic & Forschheimer, 2002). It is a textured 3D-model of a male face comprised of approximately 15000 polygons (Figure 3). The MPEG-4 standard allows the face to be controlled directly by number of parameters (FAPs, facial animation parameters). The FAPs specify the movements of a number of feature points in the face, and are normalized with respect to face dimensions, to be independent of the specific face model. Thus it is possible to drive the face from points measured on a face that differs in geometry with respect to the model. 1 http://www.qualisys.se/ (Mars2005) Figure 1 Marker placements on the speakers face for the recording of the 75 sentences used to train the models for expressive articulation. 4. DATA DRIVEN SYNTHESIS OF EXPRESSIVE SPEECH In order to synthesise visual articulation synchronised with acoustic speech, we start from a time-aligned transcription of the acoustic signal, a string of phonemes with associated start and end times. This transcription can be obtained from a text-to-speech system, if we are synchronising with synthetic speech, or it can be the result of a phonetic aligner such as NALIGN (Sjölander and Heldner, 2004) if we deal with pre-recorded speech. Given the phonetic specification, we need an algorithm that generates MPEG-4 parameter tracks that can be used to animate the face model. We will refer to this as the articulatory control model. In order to produce convincing and smooth articulation, the articulatory control model will have to model coarticulation, which refers to the way in which the realisation of a phonetic segment is influenced by neighbouring segments. The recorded corpus with expressive speech will be used to create (train) articulatory control models for each of the recorded emotions, so that it learns to predict the patterns and is able to produce articulatory movements for novel (arbitrary) Swedish speech. Thus, expression and articulation are modelled in an integrated fashion. 4.1 Articulatory control model We have adopted the coarticulation model by Cohen & Massaro (1993), which is based on Löfqvist's (1990) gestural theory of speech production. We have previously evaluated this model for data-driven modelling of neutral articulation, and found it to perform well compared to other models (Beskow, 2004). Thus it was a natural choice also for the task of modelling expressive speech. In the Cohen-Massaro model, each phonetic segment is assigned a target vector of articulatory parameters. The target values are then blended over time using a set of overlapping temporal dominance functions. The dominance functions take the shape of a pair of negative exponential functions, one rising and one falling. The height of the peak, the rate with which the dominance rises and falls, as well as the shape of the slope (exponent) are free parameters that can be adjusted independently for each phoneme and articulatory control parameter. The trajectory of a parameter z(t) can be calculated as

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evaluation of the Humanity Research Paradigms based on Analysis of Human – Environment Interaction

As claimed by many behavioral scientists, designing should be based on the knowledge of interaction between human and environment. Environmental quality is also created in the context in which humans interact with their environment. To achieve such quality, designers should develop appropriate models for explaining this relationship, and this requires an understanding of human nature and the en...

متن کامل

The Symbiosis of Human and Semantic Technology Through the Lens of Actor-Network Theory

Background:  Semantic technologies (STs) have made machine reasoning possible by providing intelligent data management methods. This capability has created new forms of interaction between humans and STs, which is called "semantic interaction."  The increasing spread of this form of interaction in daily life reveals the need to identify the factors affecting it and introduce the requirements of...

متن کامل

Machine Learning and Citizen Science: Opportunities and Challenges of Human-Computer Interaction

Background and Aim: In processing large data, scientists have to perform the tedious task of analyzing hefty bulk of data. Machine learning techniques are a potential solution to this problem. In citizen science, human and artificial intelligence may be unified to facilitate this effort. Considering the ambiguities in machine performance and management of user-generated data, this paper aims to...

متن کامل

The Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language

Machine Translation Evaluation Metrics (MTEMs) are the central core of Machine Translation (MT) engines as they are developed based on frequent evaluation. Although MTEMs are widespread today, their validity and quality for many languages is still under question. The aim of this research study was to examine the validity and assess the quality of MTEMs from Lexical Similarity set on machine tra...

متن کامل

Investigating Communicative Feedback Phenomena across Languages and Modalities

This thesis deals with human communicative behaviour related to feedback, analysed across languages (Italian and Swedish), modalities (auditory versus visual) and different communicative situations (human-human versus human-machine dialogues). The aim of this study is to give more insight into how humans use communicative behaviour related to feedback and at the same time to suggest a method to...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005